7  Chi-square tests

Our last set of statistical analysis tests are chi-square tests. These tests are looking at association patterns between either two categorical variables or one categorical variable with specific proportions. 

When you are comparing two categorical variables, this is called a chi-square test of independence.

When you are comparing one categorical variable with specific proportions, this is called a chi-square goodness of fit test.

8 Chi-square Goodness of Fit Test - Different Proportions

Photo by Fitsum Admasu on Unsplash

For our Chi-square Goodness of Fit Test, we can refer back to the living_well dataset we utilised in the previous chapter. Recall that baseline data was collected between October 2014 and June 2017 from the Living Well Multicultural—Lifestyle Modification Program. People living in ethnic communities in Queensland who were ≥18 years old, and not underweight were eligible to participate. The Living Well Multicultural–Lifestyle Modification Program is a tailored healthy lifestyle program.

According to the Australian Institute of Health and Welfare, the distribution of BMI for females aged 18 and over as of 2022 are:

  • 2.23% underweight

  • 36% normal weight

  • 29.6% overweight

  • 17.4% obese (class 1)

  • 8% obese (class 2)

  • 5.4% obese (class 3)

As we are given BMI and weight class for each individual, a potential research question we can ask is whether the participation in the Living Well Multicultural Program lead to a post-program BMI distribution that differs from the national BMI distribution of adult Australian females as reported by the AIHW (2022).

We can combine Obese class 1, 2, and 3 as we don’t differentiate between these classes in our dataset. This gives us 30.8%. Underweight individuals were not allowed to participate in the program, so we can ignore this for now. As we do not have an underweight class in our dataset (as underweight individuals were ineligible to participate), the remaining national proportions no longer sum to 1.

To address this, we simply normalise them by dividing each proportion by the total of the three raw proportions. (If your proportions already sum to 1, this step is not required.)

raw <- c(Normal = 0.36, Overweight = 0.296, Obese = 0.308) 
prop <- raw / sum(raw) 
prop 
    Normal Overweight      Obese 
 0.3734440  0.3070539  0.3195021 

Step 1: Hypotheses

We can create our hypotheses:

\(H_0:\) The distribution of BMI categories (normal weight, overweight, obese) among female participants in the Living Well Multicultural Program after completing the program matches the national distribution of adult Australian females who are not underweight (normal = 0.373, overweight = 0.307, obese = 0.320).

\(H_A\): The distribution of BMI categories among female participants after the program differs from the national distribution.

Mathematically, we can write the null hypothesis as:

\(H_0: p_{normal}=0.373\) , \(p_{overweight}=0.307\) , \(p_{obese}=0.320\)

We can’t write \(H_A\) as an equation, but note that it means at least one proportion is different to these null proportions.

Step 2: Conditions

We require independence, and sufficient sample size.

The chi-square test requires independent observations and adequate expected counts (\(\geq 5\)). In this study, data were collected from individual participants through face-to-face measurements and structured questionnaires at baseline, week 8, and week 14, administered by trained multicultural health workers.

Each participant provided their own responses and measurements, so the observations can reasonably be treated as independent at a single time point.

For sample size, the expected counts in each BMI category must exceed five. Note - We’ll come back to this in a few minutes!

Step 3: Run the test

We can now load in the dataset. Take a moment to attempt this step yourself by loading the data and the appropriate packages.

# Load in tidyverse package library(tidyverse) # Load in ggmosaic package library(ggmosaic) # Load in tidyr library(tidyr) # Load in data set living_well <- read.csv("Data-sets/Cleaned_data_pre_post.csv", stringsAsFactors = TRUE)
# Load in tidyverse package
library(tidyverse)
# Load in ggmosaic package
library(ggmosaic)
# Load in tidyr
library(tidyr)
# Load in data set 
living_well <- read.csv("Data-sets/Cleaned_data_pre_post.csv", 
                   stringsAsFactors = TRUE)

How can we filter the data to represent the relevant time we are interested in (baseline vs post week 8) and females? You can also select the variables we are interested in.

  1. What function do we use to filter data?
  2. What function do we use to select certain categories in our data?

For this exercise, filter by Time and Gender. You should also select ID, BMI, BMI_cate1, and BMI_cate2.

post_only <- living_well |> filter( Time == "Post week8", Gender == "Female") %>% select(ID, BMI, BMI_cate1, BMI_cate2)
post_only <- living_well |>
  filter(
    Time == "Post week8",
    Gender == "Female") %>%
  select(ID, BMI, BMI_cate1, BMI_cate2)

We can now remove any missing values from the relevant columns. In this dataset, the value 999 indicates a missing entry.

  1. What function allows us to modify or create variables?
  2. What function converts the value 999 into a missing value?
  3. What function removes rows containing missing values?

Note that we can use droplevels to remove any unused factor levels that remain after filtering.

post_only <- post_only %>% mutate( BMI_cate1 = na_if(BMI_cate1, "999"), BMI_cate2 = na_if(BMI_cate2, "999") ) %>% drop_na(BMI_cate1, BMI_cate2) %>% droplevels()
post_only <- post_only %>%
  mutate(
    BMI_cate1 = na_if(BMI_cate1, "999"),
    BMI_cate2 = na_if(BMI_cate2, "999")
  ) %>%
  drop_na(BMI_cate1, BMI_cate2) %>%
  droplevels()

Next, we create a table of observed counts for each BMI category in the dataset and relevel the factor so that the category order matches the order of the expected proportions.

  1. What function creates a table?
levels(post_only$BMI_cate2)
[1] "Normal weight" "Obese"         "Overweight"   
  1. What order should our labels be in? Please list the labels separated by commas (for example: x, y, z You can refer to the labels above.
counts <- table(factor(
  post_only$BMI_cate2,
  levels = c("Normal weight", "Overweight", "Obese")
))
counts

Normal weight    Overweight         Obese 
          117           137           138 

We now need to ensure that the expected proportions are in the same order as the BMI category levels in our dataset.

We can now run the chi-square goodness of fit test.

ChiT <- chisq.test(counts, p = prop)
ChiT

    Chi-squared test for given probabilities

data:  counts
X-squared = 9.4985, df = 2, p-value = 0.008658

Based on the output above:

  1. What is the degrees of freedom?
  2. What is the test statistic?
  3. Should we accept or reject the null hypothesis?

Using the output shown, try writing a conclusion.

Our output (\(\chi^2 =9.4985\), \(df=2\), \(p = 0.008658\)) provides us with enough evidence to reject the null hypothesis and conclude that the BMI distribution of female participants after completing the program is different from the stated distribution.

A couple of notes:

  • We are able to make this conclusion as the p-value is less than 0.05

  • The df (degrees of freedom) in this case is calculated as the number of categories minus 1. Given that there are three different BMI categories that are being examined the df = 3 - 1 = 2.

  • RStudio can calculate the expected counts for you:

ChiT$expected
Normal weight    Overweight         Obese 
     146.3900      120.3651      125.2448 

As the expected counts are much greater than 5, this tells us that the conditions for the chi-square goodness of fit test has been satisfied.

If you were interested in which BMI categories contributed to the significant result you could use:

ChiT$stdres

Normal weight    Overweight         Obese 
    -3.068771      1.821457      1.381636 

This command will calculate standardised residual values for each category.

As a rule of thumb, a standard residual value greater than 2 indicates that counts for that given category were higher than would be expected if the given proportions being tested was true.

A standard residual value less than -2 indicates that counts for that given category were lower than would be expected if the given proportions being tested was true.

9 Chi-square Test of Independence

Photo by Towfiqu barbhuiya on Unsplash

We will return to the 2023 Early Childhood Education and Care (ECEC) survey, which we first examined when learning about t-tests. This survey gathered responses from roughly 2,000 NSW parents and carers during January and February 2023, and aimed to understand which policy options families value most and what barriers influence their access to early childhood education and care.

Begin by loading the dataset along with the packages forcats and ggmosaic. Remember, if you have never installed these packages locally, you will need to use install.packages().

# Load in ggmosaic package library(ggmosaic) # Load in forcats package library(forcats) # Load in data set ecec <- read.csv("Data-sets/ecec-survey-2023.csv", stringsAsFactors = TRUE)
# Load in ggmosaic package
library(ggmosaic)
# Load in forcats package
library(forcats)
# Load in data set 
ecec <- read.csv("Data-sets/ecec-survey-2023.csv",
                 stringsAsFactors = TRUE) 

Take some time to think of research questions you might ask about this dataset where two categorical variables are compared.

For this section, we will focus on the research question: Does the distribution of income categories differ between male and female carers?

1. Step 1: Hypotheses

\(H_0:\) Gender and income level for carers across NSW is independent.

\(H_A:\) Gender and income level for carers across NSW are associated.

(we typically wouldn’t write these hypotheses as equations)

2. Step 2: Conditions

What conditions do we need to meet for a chi-square test?

  • independence

  • expected count greater than 5

Try cleaning the dataset yourself. Filter the variables you need, rename any that will make the analysis clearer, and drop any unused factor levels.

ecec_chi <- ecec %>% # rename SQ1 to `Gender` and D13 to `Annual_Income` to be more informative rename( Gender = SQ1, Annual_Income = D13 ) %>% # select the newly named variables select(Gender, Annual_Income) %>% # drop the missing values from the relevant variables drop_na(Gender, Annual_Income) %>% # filter Gender to include only males and females filter(Gender %in% c("Male", "Female")) %>% # filter out "Prefer not to answer" from `Annual_Income` filter(Annual_Income != "Prefer not to answer") %>% # Drop unused levels droplevels()

ecec_chi <- ecec %>%
# rename SQ1 to `Gender` and D13 to `Annual_Income` to be more informative
  rename(
    Gender = SQ1,
    Annual_Income = D13
  ) %>%
# select the newly named variables
  select(Gender, Annual_Income) %>%
# drop the missing values from the relevant variables
  drop_na(Gender, Annual_Income) %>%
# filter Gender to include only males and females
  filter(Gender %in% c("Male", "Female")) %>%
# filter out "Prefer not to answer" from `Annual_Income`
   filter(Annual_Income != "Prefer not to answer") %>%
# Drop unused levels
  droplevels() 

Let’s take a look at the structure of our cleaned data.

str(ecec_chi)
str(ecec_chi)
'data.frame':   1924 obs. of  2 variables:
 $ Gender       : Factor w/ 2 levels "Female","Male": 2 2 1 1 1 2 1 2 2 1 ...
 $ Annual_Income: Factor w/ 14 levels "$1 - $19,999 per year ($1 - $379 per week)",..: 5 2 3 2 2 13 5 4 2 4 ...

We can use a mosaic plot to visualise the relative proportions across groups. Because some of the labels are quite long, renaming them to more concise terms will help improve the readability of the plot.

ecec_chi3 <- ecec_chi %>%
  mutate(Income_Group = fct_collapse(
    Annual_Income,
    `Zero/Neg` = "Negative or Zero Income",
    `<50k` = c("$1 - $19,999 per year ($1 - $379 per week)",
               "$20,000 - $29,999 per year ($380 - $579 per week)",
               "$30,000 - $39,999 per year ($580 - $769 per week)",
               "$40,000 - $49,999 per year ($770 - $959 per week)"),
    `50k-100k` = c("$50,000 - $59,999 per year ($960 - $1149 per week)",
                   "$60,000 - $79,999 per year ($1150 - $1529 per week)",
                   "$80,000 - $99,999 per year ($1530 - $1919 per week)"),
    `100k-150k` = c("$100,000 - $124,999 per year ($1920 - $2399 per week)",
                    "$125,000 - $149,999 per year ($2400 - $2879 per week)"),
    `150k+` = c("$150,000 - $199,999 per year ($2880 - $3839 per week)",
                "$200,000 - $249,999 per year ($3840 - $4799 per week)",
                "$250,000 - $299,999 per year ($4800 - $5759 per week)",
                "$300,000 or more per year ($5760 or more per week)")
  ))

ggplot(ecec_chi3) +
  geom_mosaic(aes(x = product(Income_Group), 
                  fill = Gender, 
                  weight = 1),
              color = "white", size = 0.2) +
  theme_minimal() +
  labs(title = "Gender by Income Group",
       x = "Income Group",
      y = "Gender")

How would you interpret the above plot?

Clearly, there are a very small number of observations for our Zero/Negative income category (Try checking this - you should only see 3!). This will likely cause an issue once we run the chi-square test, as the expected count won’t be above 5.

It appears that 50k-100k and 100-150k are the most common income groups, followed by 150k+ and then <50k for both men and women. We can see that there are many more women who earn under 50k than men. In the 150k+ income group, the proportion of men and women is fairly even.

If we take a look at the number of females and males in our study:

table(ecec_chi$Gender)

Female   Male 
  1249    675 

There are nearly double the number of females compared with males, which means the plot will naturally show more pink overall.

3. Step 3: Run the test

Now try running the chi-square test.

ChiT_ecec <- chisq.test(table(ecec_chi$Annual_Income,ecec_chi$Gender))

You may notice a warning. This typically appears when one of the expected counts is below 5.

ChiT_ecec$expected
                                                       
                                                            Female       Male
  $1 - $19,999 per year ($1 - $379 per week)             24.668399  13.331601
  $100,000 - $124,999 per year ($1920 - $2399 per week) 212.927235 115.072765
  $125,000 - $149,999 per year ($2400 - $2879 per week) 148.010395  79.989605
  $150,000 - $199,999 per year ($2880 - $3839 per week) 195.399688 105.600312
  $20,000 - $29,999 per year ($380 - $579 per week)      45.441788  24.558212
  $200,000 - $249,999 per year ($3840 - $4799 per week)  75.303534  40.696466
  $250,000 - $299,999 per year ($4800 - $5759 per week)  20.773389  11.226611
  $30,000 - $39,999 per year ($580 - $769 per week)      46.740125  25.259875
  $300,000 or more per year ($5760 or more per week)     18.176715   9.823285
  $40,000 - $49,999 per year ($770 - $959 per week)      46.090956  24.909044
  $50,000 - $59,999 per year ($960 - $1149 per week)     70.110187  37.889813
  $60,000 - $79,999 per year ($1150 - $1529 per week)   141.518711  76.481289
  $80,000 - $99,999 per year ($1530 - $1919 per week)   201.891372 109.108628
  Negative or Zero Income                                 1.947505   1.052495

In this case, the expected count for the ‘Negative or Zero Income’ category is 1.95 and 1.05, which is well under our condition of \(\geq 5\). To address this, we can combine this category with the next lowest group, ‘$1–19,999 per year’, and create a new category labelled ‘under 20k’.

ecec_chi <- ecec_chi %>%
  mutate(Annual_Income = fct_collapse(
    Annual_Income,
    `Under $20k` = c("Negative or Zero Income",
                     "$1 - $19,999 per year ($1 - $379 per week)")
  ))

After combining the categories, run the test again.

ChiT_ecec <- chisq.test(table(ecec_chi$Annual_Income,ecec_chi$Gender))
ChiT_ecec

    Pearson's Chi-squared test

data:  table(ecec_chi$Annual_Income, ecec_chi$Gender)
X-squared = 54.962, df = 12, p-value = 1.838e-07

Voila! No warnings are produced, indicating that the chi-square assumptions are satisfied. You can check the expected counts to verify.

ChiT_ecec$expected
                                                       
                                                           Female       Male
  Under $20k                                             26.61590  14.384096
  $100,000 - $124,999 per year ($1920 - $2399 per week) 212.92723 115.072765
  $125,000 - $149,999 per year ($2400 - $2879 per week) 148.01040  79.989605
  $150,000 - $199,999 per year ($2880 - $3839 per week) 195.39969 105.600312
  $20,000 - $29,999 per year ($380 - $579 per week)      45.44179  24.558212
  $200,000 - $249,999 per year ($3840 - $4799 per week)  75.30353  40.696466
  $250,000 - $299,999 per year ($4800 - $5759 per week)  20.77339  11.226611
  $30,000 - $39,999 per year ($580 - $769 per week)      46.74012  25.259875
  $300,000 or more per year ($5760 or more per week)     18.17672   9.823285
  $40,000 - $49,999 per year ($770 - $959 per week)      46.09096  24.909044
  $50,000 - $59,999 per year ($960 - $1149 per week)     70.11019  37.889813
  $60,000 - $79,999 per year ($1150 - $1529 per week)   141.51871  76.481289
  $80,000 - $99,999 per year ($1530 - $1919 per week)   201.89137 109.108628

Now that the chi-square assumptions have been met and the test is valid, try interpreting the output. Look at the p-value and decide whether to reject or not reject the null hypothesis, and explain what this result means in the context of the research question.

Our output (\(\chi^2 =54.962\), \(df=12\), \(p = 1.838*10^{-7}\)) provides us with enough evidence to reject the null hypothesis and conclude that the annual income and gender are associated for carers across NSW.

You can also take a look at the standardised residuals.

ChiT_ecec$stdres
                                                       
                                                             Female        Male
  Under $20k                                             2.77340715 -2.77340715
  $100,000 - $124,999 per year ($1920 - $2399 per week) -1.76923708  1.76923708
  $125,000 - $149,999 per year ($2400 - $2879 per week) -0.44495658  0.44495658
  $150,000 - $199,999 per year ($2880 - $3839 per week) -2.81409940  2.81409940
  $20,000 - $29,999 per year ($380 - $579 per week)      2.69377297 -2.69377297
  $200,000 - $249,999 per year ($3840 - $4799 per week) -1.06441668  1.06441668
  $250,000 - $299,999 per year ($4800 - $5759 per week)  0.08464871 -0.08464871
  $30,000 - $39,999 per year ($580 - $769 per week)      3.84095076 -3.84095076
  $300,000 or more per year ($5760 or more per week)    -2.46396134  2.46396134
  $40,000 - $49,999 per year ($770 - $959 per week)      3.01775862 -3.01775862
  $50,000 - $59,999 per year ($960 - $1149 per week)    -0.02286838  0.02286838
  $60,000 - $79,999 per year ($1150 - $1529 per week)    0.97682805 -0.97682805
  $80,000 - $99,999 per year ($1530 - $1919 per week)   -0.37521579  0.37521579